Upgrade aws java sdk to v2 #8441

rahil-c · 2023-04-12T17:40:54Z

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

yihua

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

yihua · 2023-04-12T18:37:45Z

hudi-aws/pom.xml

            <groupId>com.amazonaws</groupId>
            <artifactId>dynamodb-lock-client</artifactId>
-            <version>${dynamodb.lockclient.version}</version>
+            <version>1.2.0</version>


nit: could you keep the version variable?

As in we do not specify a version and let it pull in the latest version?

yihua · 2023-04-12T18:39:01Z

hudi-aws/src/main/java/org/apache/hudi/aws/cloudwatch/CloudWatchReporter.java

-    return AmazonCloudWatchAsyncClientBuilder.standard()
-        .withCredentials(HoodieAWSCredentialsProviderFactory.getAwsCredentialsProvider(props))
+  private static CloudWatchAsyncClient getAmazonCloudWatchClient(Properties props) {
+    return CloudWatchAsyncClient.builder()


What does .standard() do before? Is it by default in the builder now?

https://github.com/aws/aws-sdk-java-v2/blob/master/docs/LaunchChangelog.md#4-service-changes.

If you look at the table for several of these aws clients the table shows a before column where the .standard was used in 1.x and now in 2.x it is replaced by using the .builder pattern. See this snippet taken from link above

Client builders no longer contain static methods. The static methods on the clients must be used: AmazonDynamoDBClientBuilder.defaultClient is now DynamoDbClient.create and AmazonDynamoDBClientBuilder.standard is now DynamoDbClient.builder.

hudi-aws/src/main/java/org/apache/hudi/aws/credentials/HoodieAWSCredentialsProviderFactory.java

yihua · 2023-04-12T19:12:34Z

hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java

+      StorageDescriptor partitionSD = sd.copy(copySd -> copySd.columns(newColumns));
+      final Instant now = Instant.now();
+      TableInput updatedTableInput = TableInput.builder()
+          .tableType(table.tableType())


table name is missed?

Thanks for catching this, will fix this and check this class once more just to be safe for any misses like this

yihua · 2023-04-12T19:14:13Z

hudi-aws/src/main/java/org/apache/hudi/aws/sync/AWSGlueCatalogSyncClient.java

-          .withDatabaseName(databaseName)
-          .withTableInput(updatedTableInput);
+      StorageDescriptor sd = table.storageDescriptor();
+      StorageDescriptor partitionSD = sd.copy(copySd -> copySd.columns(newColumns));


Here it makes a copy instead of in-place change. Any reason of doing this?

In SDK v2 the POJOS are immutable so I cant modify the original StorageDescriptor sd object and the only option is to make a clone and pass that.

From docs https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/migration-whats-different.html

Immutable POJOs Clients and operation request and response objects are now immutable and cannot be changed after creation. To reuse a request or response variable, you must build a new object to assign to it.

yihua · 2023-04-12T20:50:53Z

hudi-aws/src/main/java/org/apache/hudi/aws/utils/DynamoTableUtils.java

+ * // ... start making calls to table ...
+ * </pre>
+ */
+public class DynamoTableUtils {


Why do we have to copy the code from the library?

Yes unfortunately this is an issue.

Basically in hudi we were using the TableUtils class which was a src class in V1 that we would invoke in dynamodb locking feature, but now in v2 this class been moved as a test class see this commit. aws/aws-sdk-java-v2@973237c.

Because of this we can not access anymore this class anymore due to its test scope. I didnt see any replacement DynamoDB utility like classes in the SDK v2

So Currently im opted to resuse class over and left a comment to where this code can be found so we dont have to reinvent here. Let me know if you have any other ideas on how we can go about this.

yihua · 2023-04-12T20:53:56Z

hudi-utilities/pom.xml

+
+    <dependency>
+      <groupId>org.apache.httpcomponents</groupId>
+      <artifactId>httpclient</artifactId>
+      <version>${aws.sdk.httpclient.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.httpcomponents</groupId>
+      <artifactId>httpcore</artifactId>
+      <version>${aws.sdk.httpcore.version}</version>
+    </dependency>
+


hudi-aws module already adds this so there is no need to add both dependencies again?

Makes sense will remove these and run dependency tree to confirm.

yihua · 2023-04-12T20:54:35Z

hudi-utilities/pom.xml

+    <!-- AWS Services -->
+    <!-- https://mvnrepository.com/artifact/software.amazon.awssdk/aws-java-sdk-sqs -->
+    <dependency>
+      <groupId>software.amazon.awssdk</groupId>
+      <artifactId>sqs</artifactId>
+      <version>${aws.sdk.version}</version>
+    </dependency>
+


Should this be in hudi-aws module? Why is it required now?

It seems that the events source classes inside the hudi utilities module will require this dependency.
https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java

but i see what you mean we dont have to place this dependency inside the utiltiies pom and instead we place it in hudi aws pom https://github.com/apache/hudi/blob/master/hudi-aws/pom.xml#L144

yihua · 2023-04-12T21:01:19Z

packaging/hudi-utilities-bundle/pom.xml

+                  <include>org.apache.hudi:hudi-aws</include>
+                  <include>software.amazon.awssdk:*</include>
+                  <include>org.apache.httpcomponents:httpclient</include>
+                  <include>org.apache.httpcomponents:httpcore</include>


We should avoid this. If AWS ecosystem is used, hudi-aws-bundle should be used with hudi-utilities-bundle: https://hudi.apache.org/releases/release-0.12.2#bundle-updates.

so we basically remove all of this from the utilities bundle, and then its mandatory that user will have to pass both utilities bundle and aws bundle?

So we can remove this change here and user will pass bundle then.

hudi-bot · 2023-04-12T21:19:58Z

CI report:

c557b96 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

rahil-c · 2023-04-19T22:50:47Z

@yihua

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

So far I have only tested with Spark3 and Hadoop3. And I manually tested the aws hudi features like OCC, GlueSyncTool, CloudWatchMetricsReporter on EMR.

I can try looking into doing some tests with flink but just curious what combinations of version testing I should do before we land this?( Hadoop, Spark, Flink)

yihua · 2023-08-03T17:06:34Z

@yihua

LGTM since most of the changes update API usage only. Is there any compatibility issue with different Spark, Flink and Hadoop versions?

So far I have only tested with Spark3 and Hadoop3. And I manually tested the aws hudi features like OCC, GlueSyncTool, CloudWatchMetricsReporter on EMR.

I can try looking into doing some tests with flink but just curious what combinations of version testing I should do before we land this?( Hadoop, Spark, Flink)

Synced offline already sometime before. We should test Spark 2.4, Spark 3.x with Hadoop 2 on S3 to make sure there is no compatibility issue, especially around S3A file system.

yihua · 2023-08-03T17:07:07Z

Closing this in favor of #9347 .

Upgrade aws java sdk to v2

c557b96

yihua self-assigned this Apr 12, 2023

yihua reviewed Apr 12, 2023

View reviewed changes

yihua added dependencies Dependency updates aws-support labels Apr 12, 2023

yihua added the big-needle-movers label May 2, 2023

yihua added priority:blocker Production down; release blocker release-0.14.0 labels Jul 6, 2023

yihua closed this Aug 3, 2023

Upgrade aws java sdk to v2 #8441

Upgrade aws java sdk to v2 #8441

Uh oh!

Conversation

rahil-c commented Apr 12, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 12, 2023

CI report:

Uh oh!

rahil-c commented Apr 19, 2023

Uh oh!

yihua commented Aug 3, 2023

Uh oh!

yihua commented Aug 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants